This project applies predictive modeling techniques to analyze cardiovascular health, focusing on two key objectives: predicting maximum heart rate achieved (thalach) and classifying the presence of heart disease. The analysis begins with an exploratory data assessment to understand variable distributions and relationships before implementing a range of regression and classification models.
For thalach prediction, multiple regression approaches were explored, including linear regression, Lasso regression, and regression trees. Among these, the tuned regression tree model emerged as the most effective, achieving the lowest RMSE (14.56) and MAE (11.28) and the highest R² (0.56), significantly outperforming linear models, which explained only 39% of the variance. The Variable Importance Plot (VIP) revealed that ST segment (slope), age, and ST depression (oldpeak) were the most influential predictors of thalach, reinforcing their significance in cardiovascular assessments.
For heart disease classification, both logistic regression and classification trees were evaluated. The classification trees outperformed logistic regression, achieving 93% accuracy, 91% sensitivity, and 95% specificity and precision. The ROC Curve confirmed this distinction, with the classification tree achieving an AUC of 0.98. The Variable Importance Plot (VIP) identified chest pain type (cp), thalassemia status (thal), maximum heart rate (thalach) as the most critical factors in predicting heart disease, aligning with established medical risk indicators.
A key observation was that adjusting the classification threshold to 0.59 had minimal impact, as most predicted probabilities remained below this value, leading to unchanged classification results. This highlighted the importance of careful threshold selection and model tuning to optimize classification performance.
In conclusion, this project demonstrates the effectiveness of regression tree models for predicting maximum heart rate and classification trees for identifying heart disease. The findings emphasize the importance of feature selection, model complexity, and cutoff optimization in improving predictive accuracy and clinical relevance.
According to CDC, one person dies every 33 seconds from cardiovascular disease. Heart disease poses a significant health risk, making it crucial to identify its risk factors to implement proactive measures before it becomes critical. This report seeks to identify the key risk factors contributing to heart disease through an analysis of a dataset. This data collection is dated 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V representing individual patient records. This dataset is designed to help predict the presence of heart disease based on various medical, demographic, and diagnostic information about the patients.
For this analysis, I will first examine the distribution of the variables and look for relationships. Second, I will conduct regression analysis to predict maximum heart rate achieved (thalach), as it is a key indicator of cardiovascular fitness and heart function. This will include a variety of methods, such as linear regression, lasso regression and regression trees to determine the most significant predictors and identify the most effective model for understanding how different factors influence thalach. Third, I will perform logistic regression using backward elimination and classification trees to predict the presence of heart disease. This is a classification task, where the dependent variable (target) indicates whether heart disease is present or absent based on multiple attributes. Finally, I will end with summarizing the conclusions and reflections.
The DataThis dataset has 1026 rows and 14 variables.
Data SourcesThe data set used is from Kaggle (2019), titled Heart Disease Dataset, provided by David Lapp.
This dataset contains a variety of medical, demographic, and diagnostic attributes that help in predicting heart disease. I converted 8 categorical variables (sex, cp, fbs, restecg, exang, slope, ca, and thal) to factors.
Within the dataset, we can see that the age of patients ranges from 29 to 77 years, with a mean of approximately 54 years, reflecting a relatively young population. Additionally, there is a notable gender disparity, with 713 male patients compared to 312 female patients. Several critical cardiovascular indicators exhibit wide variability. For instance, ST depression (oldpeak), which measures heart stress during exercise, ranges from 0 to 6.2, with higher values indicating more severe heart conditions. The maximum heart rate achieved (thalach) spans from 71 to 202 beats per minute, with a median of 152 bpm.
Looking at the overall distribution of heart disease cases, 526 out of 1025 patients (51.3%) have a heart condition, suggesting that more than half of the individuals in the dataset are affected. This highlights the importance of analyzing key risk factors that contribute to heart disease.
Below is the range of values for each variable.
age sex cp trestbps chol fbs restecg
Min. :29.00 0:312 0:497 Min. : 94.0 Min. :126 0:872 0:497
1st Qu.:48.00 1:713 1:167 1st Qu.:120.0 1st Qu.:211 1:153 1:513
Median :56.00 2:284 Median :130.0 Median :240 2: 15
Mean :54.43 3: 77 Mean :131.6 Mean :246
3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
Max. :77.00 Max. :200.0 Max. :564
thalach exang oldpeak slope ca thal target
Min. : 71.0 0:680 Min. :0.000 0: 74 0:578 0: 7 0:499
1st Qu.:132.0 1:345 1st Qu.:0.000 1:482 1:226 1: 64 1:526
Median :152.0 Median :0.800 2:469 2:134 2:544
Mean :149.1 Mean :1.072 3: 69 3:410
3rd Qu.:166.0 3rd Qu.:1.800 4: 18
Max. :202.0 Max. :6.200
age sex cp trestbps chol fbs restecg thalach
0 0 0 0 0 0 0 0
exang oldpeak slope ca thal target
0 0 0 0 0 0
There are no missing values.
Patients with heart disease is slightly more than those without heart disease (48.7% vs 51.3%)
Relation between Age and Heart Disease: Heart disease (target = 1) appears to affect individuals across a broader age range, with a tendency toward younger ages compared to those without heart disease. On the other hand, the absence of heart disease (target = 0) is more common among older individuals in this dataset.
Relation between Gender and Heart Disease: Heart disease is observed in both genders but is more frequent among males. While females have a smaller population in the dataset, the proportion with heart disease seems relatively high compared to males. This suggests that heart disease affects both genders significantly but may impact males more in this dataset.
Relation between Sex, Age and Heart Disease: Among patients without heart disease (target = 0), there is a relatively even spread of males and females, though males appear slightly more frequent. However, for patients with heart disease (target = 1), males (blue) are much more prevalent than females (red), suggesting that men in this dataset are at a higher risk of developing heart disease. Additionally, age does not show a clear separation between those with and without heart disease, as patients with heart disease span a broad age range, indicating that age alone may not be the strongest predictor.
Relation between Thalach, Age and Heart Disease: Higher maximum heart rates are associated more frequently with the presence of heart disease (target = 1), and younger individuals may achieve higher heart rates compared to older individuals.
The correlation matrix indicates that chest pain type (cp) and maximum heart rate achieved (thalach) have the strongest positive correlations with target, making them critical predictors of heart disease. Conversely, oldpeak (ST depression), number of major vessels (ca) and exercise-induced angina (exang) show strong negative correlations with target, highlighting their importance in predicting heart disease absence. There are no signs of multicollinearity among the independent variables, as no correlations between predictors exceed the threshold of ±0.6, ensuring stable regression models.
The histogram displays the distribution of maximum heart rate achieved (thalach) across the dataset. The distribution appears slightly right-skewed, with the majority of values concentrated between 100 and 180 bpm, and a peak around 150 bpm, indicating that most individuals in the dataset reach a heart rate in this range. There are relatively fewer cases at the lower and higher extremes. The presence of multiple peaks suggests some variability in the data, potentially due to differences in age, fitness level, or health conditions among the patients.
Welch Two Sample t-test
data: thalach by target
t = -14.862, df = 976.86, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-22.02427 -16.88631
sample estimates:
mean in group 1 mean in group 2
139.1303 158.5856
The extremely small p-value suggests that the difference in means is highly statistically significant. This means the observed difference is unlikely to be due to random chance and likely represents a real effect in the population.
For the prediction of the continuous variable value (thalach) we will use linear regression. The results are summarized in this section.
After examining the final model, we can observe some issues in the residual plots that indicate potential concerns with our data. The homogeneity of variance plot shows a curved pattern, suggesting that the residuals are not evenly distributed across the fitted values, meaning that the model may not be capturing the variability in thalach equally for all predictions. The normality of residuals plot indicates some deviation from normality, particularly in the tails, which impact the validity of conclusions.
Additionally, comparing the full and pruned models, we see that removing predictors with high p-values did not significantly improve the fit, as the R² remained relatively low (around 0.39), and the RMSE and MAE values remained nearly the same. This suggests that the linear neither the pruned model may not be the best fit for predicting thalach, and alternative approaches such as tree-based methods may improve predictive accuracy.
Effect on Thalach by the Predictor Variables| Variable | Direction |
|---|---|
| age | Decrease |
| cp | Increase |
| trestbps | Increase |
| chol | Increase |
| exang | Decrease |
| slope | Increase |
| target | Increase |
We can see that our initial linear model achieves an R-squared of 39%, indicating that it explains a small to moderate portion of the variability in the response variable. Analyzing the residual plots suggests that the residuals are mostly normal, though there is a slight curvature in the residual vs. fitted values plot and some skew in the residual distribution.
After examining the model coefficients, we determine that some predictors do not significantly contribute to predicting the thalach (p values > 0.05). Consequently, we create a pruned model by removing these less significant predictors. This pruning step aims to simplify the model, potentially improving interpretability and the model in general.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.89 | 13.95 | 0.39 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 140.22 | 9.05 | 15.50 | 0.00 |
| age | -0.84 | 0.07 | -12.05 | 0.00 |
| sex | 0.75 | 1.36 | 0.55 | 0.58 |
| cp | 2.35 | 0.64 | 3.66 | 0.00 |
| trestbps | 0.12 | 0.03 | 3.59 | 0.00 |
| chol | 0.03 | 0.01 | 2.92 | 0.00 |
| fbs | 2.30 | 1.65 | 1.40 | 0.16 |
| restecg | -1.24 | 1.10 | -1.13 | 0.26 |
| exang | -9.32 | 1.41 | -6.62 | 0.00 |
| oldpeak | -0.73 | 0.63 | -1.15 | 0.25 |
| slope | 8.07 | 1.15 | 7.04 | 0.00 |
| ca | -0.54 | 0.62 | -0.87 | 0.39 |
| thal | 1.96 | 0.98 | 1.99 | 0.05 |
| target | 7.60 | 1.60 | 4.76 | 0.00 |
The analysis of the pruned model reveals potential concerns with heteroscedasticity and non-linearity, which may impact the model’s reliability. The homogeneity of variance plot (residuals vs. fitted values) shows a curved trend line rather than a flat horizontal one, suggesting that the variance of residuals is not constant across predicted values. This heteroscedasticity implies that the model may perform better for some ranges of fitted values than others, leading to unreliable predictions.This suggests that the relationship between the predictors and the dependent variable cannot be fully captured by a simple linear model. On the other hand, the normality of residuals plot indicates that residuals are approximately normal with some skew. As the pruned model removes unnecessary predictors, it might be preferable for simplicity and interpretability.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.894 | 13.951 | 0.394 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 145.18 | 7.03 | 20.66 | 0 |
| age | -0.85 | 0.07 | -12.59 | 0 |
| cp | 2.50 | 0.64 | 3.93 | 0 |
| trestbps | 0.13 | 0.03 | 3.80 | 0 |
| chol | 0.04 | 0.01 | 3.16 | 0 |
| exang | -9.06 | 1.39 | -6.51 | 0 |
| slope | 8.67 | 0.99 | 8.75 | 0 |
| target | 7.23 | 1.40 | 5.16 | 0 |
After evaluating the data using linear regression, there was potential issues with non-linearity and heteroscedasticity, which affected the model’s predictive performance. To better capture complex relationships between predictors and the target variable, I implemented a lasso regression model with penalty of 0.1 and a lasso tuned regression model with trained/test data.
The analysis of the Lasso regression model highlights the impact of regularization strength on model performance. Excessive penalization (lambda values of 10 or 250) led to over-regularization. By adjusting the penalty to 0.1, the model retained important predictors while still mitigating overfitting. The residual vs. predicted plot indicates that while the model performs reasonably well, there may be some heteroscedasticity, as residuals are not uniformly distributed across predicted values. This suggests that certain ranges of the target variable may be predicted with greater accuracy than others, potentially impacting reliability.
The final lasso tuned model with training/test data gets alike metrics but as the model with λ = 0.1 has higher R² while maintaining similar numbers of RMSE and MAE, it is the better choice. This model balances regularization and predictive power, making it preferable over the more constrained λ = 1 model, which shrinks coefficients more aggressively and slightly reduces prediction accuracy.
Effect on Thalach by the Predictor Variables by Lasso Regression Penalty 0.1| Variable | Direction |
|---|---|
| age | Decrease |
| sex | Increase |
| cp | Increase |
| trestbps | Increase |
| chol | Increase |
| fbs | Increase |
| restecg | Decrease |
| exang | Decrease |
| olpeak | Decrease |
| slope | Increase |
| ca | Decrease |
| thal | Increase |
| target | Increase |
We can see that our Lasso regression model with a penalty of 0.1 achieves an R-squared of 39%, indicating that it explains a small to moderate portion of the variability in the response variable. Analyzing the residual plots suggests that while the residuals are mostly normal, there is some heteroscedasticity present, as the residuals are not uniformly distributed across predicted values.
Examining the model coefficients reveals that some predictors contribute more significantly than others, with certain variables having been shrunk close to zero due to the Lasso penalty. Initially, using a penalty of 10 or 250 caused the model to be overly restrictive, shrinking all coefficients excessively and resulting in nearly identical predictions. This indicated excessive regularization, leading to a loss of meaningful relationships between predictors and the target variable.
To address this issue, a lower penalty of 0.1 was selected, allowing the model to retain important predictors while still applying regularization to mitigate overfitting. This adjustment balances interpretability and predictive performance.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.894 | 13.951 | 0.394 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| Lasso Regression Penalty 0.1 | 17.897 | 13.948 | 0.394 |
The Lasso regression model with a penalty of 0.1 retains key predictors while applying regularization to prevent overfitting. Among the most influential variables, age (-7.47) and exang (-4.32) show strong negative relationships, indicating that an increase in age and exercise-induced angina is associated with a lower predicted outcome. In contrast, cp (2.41) and thal (1.07) have positive coefficients, suggesting that chest pain type and thalassemia classification contribute to higher predictions.
| term | estimate | penalty |
|---|---|---|
| (Intercept) | 149.11 | 0.1 |
| age | -7.47 | 0.1 |
| sex | 0.20 | 0.1 |
| cp | 2.41 | 0.1 |
| trestbps | 2.05 | 0.1 |
| chol | 1.62 | 0.1 |
| fbs | 0.72 | 0.1 |
| restecg | -0.57 | 0.1 |
| exang | -4.32 | 0.1 |
| oldpeak | -0.81 | 0.1 |
| slope | 4.95 | 0.1 |
| ca | -0.46 | 0.1 |
| thal | 1.07 | 0.1 |
| target | 3.70 | 0.1 |
During the tuning process for the Lasso regression model, I initially encountered an issue where all predictions were constant, leading to errors in computing correlation-based metrics. This occurred because the penalty (lambda) was too high, causing excessive shrinkage of the coefficients and reducing model variability. By adjusting the penalty range from 0.00001 to 0.1, I allowed the model to retain more important predictors while still applying regularization to prevent overfitting. To ensure a robust evaluation, I implemented a train-test split (80% and 20%). This approach prevented data leakage and provided a realistic assessment of the model’s performance on unseen data. I selected the penalty value corresponding to the lowest MAE to optimize the model’s predictive performance. The Mean Absolute Error (MAE) represents the average absolute difference between predicted and actual values, making it a crucial metric for assessing model accuracy. I ensured that the final model was tuned to minimize prediction errors while maintaining regularization.
# A tibble: 1 × 2
penalty .config
<dbl> <chr>
1 1.00 Preprocessor1_Model01
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.894 | 13.951 | 0.394 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| Lasso Regression Penalty 0.1 | 17.897 | 13.948 | 0.394 |
| Lasso Regression Tuned | 17.518 | 13.732 | 0.353 |
The Tuned Lasso regression model applied regularization, reducing some coefficients to zero, indicating they were not strong predictors. Key influential variables include age (-6.299) and exang (-3.717), which negatively impact the target, while slope (4.537) and target (3.196) have strong positive effects. We can see this in the Variable Importance Plot.
| term | estimate | penalty |
|---|---|---|
| (Intercept) | 149.114 | 1 |
| age | -6.299 | 1 |
| sex | 0.000 | 1 |
| cp | 2.062 | 1 |
| trestbps | 0.947 | 1 |
| chol | 0.593 | 1 |
| fbs | 0.000 | 1 |
| restecg | 0.000 | 1 |
| exang | -3.717 | 1 |
| oldpeak | -0.424 | 1 |
| slope | 4.537 | 1 |
| ca | 0.000 | 1 |
| thal | 0.000 | 1 |
| target | 3.196 | 1 |
After evaluating the data using linear and lasso regression, there was potential issues with heteroscedasticity, which affected the model’s predictive performance. To better capture complex relationships between predictors and the target variable, I implemented a regression tree model and a tuned regression tree model that can automatically detect interactions and split the data into meaningful segments, improving prediction accuracy.
Initially, the first regression tree resulted in improved performance over the linear models. The regression tree model achieved an RMSE of 15.50 and an R² of 0.55, compared to the linear regression and lasso regression model’s RMSE of 17.89 and R² of 0.39. This suggests that the tree model captures more variation in the data than linear regression.
However, to further enhance performance, I tuned the regression tree with training/test split (80% and 20%) by optimizing the cost complexity parameter and tree depth using cross-validation. The tuned regression tree model significantly outperformed the previous models, with an RMSE of 14.56 an MAE of 11.28, and an R² of 0.56. The actual vs. predicted scatter plots show a stronger correlation in the tuned tree, meaning it generalizes better. By refining the model through tuning, we achieved higher predictive power while maintaining interpretability.
I will predict thalach with all the variables. The regression tree model shows better performance than the linear and lasso models.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.894 | 13.951 | 0.394 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| Lasso Regression Penalty 0.1 | 17.897 | 13.948 | 0.394 |
| Lasso Regression Tuned | 17.518 | 13.732 | 0.353 |
| Regression Tree Model | 15.504 | 12.140 | 0.545 |
The regression tree has 15 leaf nodes. The variable importance plot shows that the top 3 most important features is slope (most influential), age and oldpeak.
To see if tuning improve performance, I will use cross validation on the cost complexity and the tree depth. After tuning (39 leaf nodes, adjusted complexity parameters), the regression tree achieved the lowest RMSE and highest R², demonstrating that the model now explains 56% of the variance in the target variable.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.894 | 13.951 | 0.394 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| Lasso Regression Penalty 0.1 | 17.897 | 13.948 | 0.394 |
| Lasso Regression Tuned | 17.518 | 13.732 | 0.353 |
| Regression Tree Model | 15.504 | 12.140 | 0.545 |
| Tuned Regression Tree Model | 14.557 | 11.279 | 0.557 |
Decision Tree Model Specification (regression)
Main Arguments:
cost_complexity = 1e-10
tree_depth = 6
Computational engine: rpart
Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
cp = 1e-10, maxdepth = 6L)
The regression tree has 39 leaf nodes. Slope, age, and oldpeak are the dominant predictors for thalach, suggesting these factors have a strong relationship with a person’s maximum heart rate during exercise. The presence of heart disease (target) also plays a role, reinforcing the medical link between cardiovascular conditions and heart rate response.
For the final model, I will use logistic regression to explore heart disease presence. From the variable importance plot, we can see that males, number of major vessels (ca2, ca3, ca4), chest pain type (cp2, cp3, cp4), and ST depression (oldpeak) are among the most significant predictors in the model. These variables play a critical role in determining heart disease risk, aligning with medical insights that suggest factors like chest pain, blood vessel count, and exercise-induced abnormalities are strong indicators of cardiovascular health. The ROC curve confirms that our logistic model performs well, with an AUC of 0.91, and adjusting the cutoff to 0.74 provides a balance between sensitivity (87%) and specificity (82%).
From the logistic regression equation, we observe that ca5 is a non-significant predictor with a p-value of 0.73. However, as a categorical predictor by nature, I decided to not make modifications as ca5 can be combined with the base category. It simplifies interpretation without losing valuable information.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -2.83 | 1.15 | -2.46 | 0.01 |
| sex2 | 2.12 | 0.26 | 8.20 | 0.00 |
| cp2 | -1.28 | 0.29 | -4.41 | 0.00 |
| cp3 | -1.97 | 0.25 | -7.78 | 0.00 |
| cp4 | -2.17 | 0.36 | -6.07 | 0.00 |
| trestbps | 0.02 | 0.01 | 3.74 | 0.00 |
| chol | 0.01 | 0.00 | 3.20 | 0.00 |
| thalach | -0.03 | 0.01 | -4.84 | 0.00 |
| exang2 | 0.94 | 0.23 | 4.07 | 0.00 |
| oldpeak | 0.64 | 0.11 | 5.84 | 0.00 |
| ca2 | 1.93 | 0.25 | 7.80 | 0.00 |
| ca3 | 2.57 | 0.35 | 7.41 | 0.00 |
| ca4 | 2.06 | 0.46 | 4.46 | 0.00 |
| ca5 | -0.28 | 0.82 | -0.34 | 0.73 |
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Logistic Model | 0.88 | 0.92 | 0.84 | 0.88 | 0.91 |
| Pruned Logistic Model | 0.81 | 0.66 | 0.96 | 0.81 | 0.75 |
Truth
Prediction Yes No
Yes 504 172
No 22 327
| Best_Cutoff | Sensitivity | Specificity | AUC_for_Model |
|---|---|---|---|
| 0.74 | 0.87 | 0.82 | 0.91 |
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Logistic Model | 0.88 | 0.92 | 0.84 | 0.88 | 0.91 |
| Pruned Logistic Model | 0.81 | 0.66 | 0.96 | 0.81 | 0.75 |
| Logistic Model Cutoff 0.74 | 0.84 | 0.82 | 0.87 | 0.84 | 0.83 |
When predicting the presence of heart disease (target = 1), I coded it so that “Yes” indicates a diagnosis of heart disease, while “No” indicates no heart disease. For this analysis, I performed both classification trees and logistic regression models to compare their predictive performance. Both models in the classification trees achieved a sensitivity of 91%, meaning they correctly identified 91% of individuals with heart disease. Additionally, they demonstrated high specificity (95%) and an overall accuracy of 93%, making it a strong predictive tool. When adjusting the classification threshold, we observed that modifying the cutoff to 0.59 had minimal impact on classification results. This occurred because most predicted probabilities were below 0.59, leading to similar classifications and unchanged performance metrics.The precision is 95% and the AUC is 98%. Overall, both models have a strong predictive performance in detecting heart disease.
I will use all the variables. For this model the cost complexity is set to .001.
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Logistic Model | 0.88 | 0.92 | 0.84 | 0.88 | 0.91 |
| Pruned Logistic Model | 0.81 | 0.66 | 0.96 | 0.81 | 0.75 |
| Logistic Model Cutoff 0.74 | 0.84 | 0.82 | 0.87 | 0.84 | 0.83 |
| Classification Tree Model | 0.93 | 0.91 | 0.95 | 0.93 | 0.95 |
Truth
Prediction Yes No
Yes 481 23
No 45 476
The classification tree has 32 leaf nodes, each representing a final decision point in the model. The more splits the tree has, the more complex its decision-making process becomes. This tree structure helps classify individuals based on various health indicators, ultimately predicting whether a person has heart disease or not. Looking at the Variable Importance Plot (VIP), the higher the importance value, the greater the influence of the variable on the model’s decision-making process. In this case, chest pain type (cp) is the most influential predictor, followed by thalassemia status (thal) and maximum heart rate achieved (thalach). These variables significantly impact the likelihood of heart disease, as they are key indicators used in medical assessments.
The summary of descriptive statistics shows that while the model makes varied predictions, most probabilities are clustered around 0.5, with only the top 25% exceeding 0.97. Since half of the predictions are below 0.46, raising the classification cutoff to 0.59 has minimal impact, as many cases remain classified as “No.” While some cases are confidently predicted as “Yes,” they are relatively rare, leading to unchanged classification results and performance metrics.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.01376 0.46154 0.51317 0.97083 1.00000
| Best_Cutoff | Sensitivity | Specificity | AUC_for_Model |
|---|---|---|---|
| 0.59 | 0.91 | 0.95 | 0.98 |
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Logistic Model | 0.88 | 0.92 | 0.84 | 0.88 | 0.91 |
| Pruned Logistic Model | 0.81 | 0.66 | 0.96 | 0.81 | 0.75 |
| Logistic Model Cutoff 0.74 | 0.84 | 0.82 | 0.87 | 0.84 | 0.83 |
| Classification Tree Model | 0.93 | 0.91 | 0.95 | 0.93 | 0.95 |
| Classification Tree Model Cutoff 0.59 | 0.93 | 0.91 | 0.95 | 0.93 | 0.95 |
From the first analysis, I can confidently recommend tuned regression tree model for predicting thalach (maximum heart rate achieved). The tuned regression tree model achieves the lowest RMSE (14.56) and MAE (11.28) while achieving the highest R² (0.56), indicating that it explains 56% of the variance in the model, significantly more than the linear models (39%). The predicted vs. actual plot further supports this conclusion, where the tuned regression tree model (purple) aligns more closely with the ideal diagonal line, suggesting better prediction accuracy. A key takeaway from the Variable Importance Plot (VIP) is that ST segment (slope), maximum heart rate (age), ST depression (oldpeak), are among the most influential predictors for modeling thalach. These insights suggest that understanding these variables can enhance decision-making related to cardiovascular health.
From the second analysis, we compared the classification tree model and logistic regression model to predict the presence of heart disease. The classification tree cutoff 0.6 consistently outperforms logistic regression in terms of accuracy (93%), sensitivity (91%), specificity (95%), and precision (95%), ensuring more accurate identification of individuals with and without heart disease. The ROC Curve confirms this distinction, with the classification tree having a higher AUC (0.98) compared to logistic regression. For the classification model, the Variable Importance Plot (VIP) highlights the most significant predictors of heart disease. The top predictors are chest pain type (cp), thalassemia status (thal) and maximum heart rate (thalach). These factors strongly influence the classification of heart disease cases and align with established medical risk factors.
Reflection: One of the aspects I am most proud of in this project is my ability to work with a real-life dataset and analyze it using R. Given that this was my first experience coding and using a programming language, I feel a great sense of accomplishment in understanding data manipulation, modeling, and interpretation in R. Throughout the project, I built confidence in my ability to apply statistical techniques, and I am excited to use these skills in future projects.
If I had another week to work on the project, I would focus on enhancing the predictive power of my models. One approach in the logistic regression would be combining predictors, such as integrating ca5 with the base model, to explore whether it improves classification performance. Additionally, I would experiment with other methods, such as Random Forest or Boosting, to compare their accuracy and robustness against the classification tree and logistic regression models. These methods could potentially improve generalizability and further optimize sensitivity and specificity in predicting heart disease.
In addition, if I compare the models for predicting thalach, we see that the pruned regression tree further enhances model performance, reducing MAE to 11.28. The R square is 56% and RMSE of 14.56.
| model | RMSE | MAE | RSQ |
|---|---|---|---|
| Linear Model Full | 17.894 | 13.951 | 0.394 |
| Linear Pruned Final Model | 17.980 | 14.036 | 0.389 |
| Lasso Regression Penalty 0.1 | 17.897 | 13.948 | 0.394 |
| Lasso Regression Tuned | 17.518 | 13.732 | 0.353 |
| Regression Tree Model | 15.504 | 12.140 | 0.545 |
| Tuned Regression Tree Model | 14.557 | 11.279 | 0.557 |
In predicting Heart Disease, the classification tree has higher precision (95%) for predicting heart disease cases (Yes), ensuring fewer false positives. It also achieves higher sensitivity (91%), meaning it correctly identifies more individuals with heart disease. Meanwhile, the logistic regression model has a lower sensitivity, specificity and precision. Given these results, the classification tree model is better.
| model | Accuracy | Sensitivity | Specificity | Avg_Sens_Spec | Precision |
|---|---|---|---|---|---|
| Logistic Model | 0.88 | 0.92 | 0.84 | 0.88 | 0.91 |
| Pruned Logistic Model | 0.81 | 0.66 | 0.96 | 0.81 | 0.75 |
| Logistic Model Cutoff 0.74 | 0.84 | 0.82 | 0.87 | 0.84 | 0.83 |
| Classification Tree Model | 0.93 | 0.91 | 0.95 | 0.93 | 0.95 |
| Classification Tree Model Cutoff 0.59 | 0.93 | 0.91 | 0.95 | 0.93 | 0.95 |